AITopics | generalized smoothness

Error Feedback under (L0,L1)-Smoothness: Normalization and Momentum

Neural Information Processing SystemsJun-23-2026, 01:57:50 GMT

We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems. Despite their popularity and efficiency in training deep neural networks, traditional analyses of error feedback algorithms rely on the smoothness assumption that does not capture the properties of objective functions in these problems. Rather, these problems have recently been shown to satisfy generalized smoothness assumptions, and the theoretical understanding of error feedback algorithms under these assumptions remains largely unexplored. Moreover, to the best of our knowledge, all existing analyses under generalized smoothness either i) focus on single-node settings or ii) make unrealistically strong assumptions for distributed settings, such as requiring data heterogeneity, and almost surely bounded stochastic gradient noise variance. In this paper, we propose distributed error feedback algorithms that utilize normalization to achieve the O(1/ K)convergence rate for nonconvex problems under generalized smoothness. Our analyses apply for distributed settings without data heterogeneity conditions, and enable stepsize tuning that is independent of problem parameters. Additionally, we provide strong convergence guarantees of normalized error feedback algorithms for stochastic settings. Finally, we show that due to their larger allowable stepsizes, our new normalized error feedback algorithms outperform their non-normalized counterparts on various tasks, including the minimization of polynomial functions, logistic regression, and ResNet-20 training.

artificial intelligence, exp, machine learning, (17 more...)

Neural Information Processing Systems

Country: Asia > Middle East (0.45)

Genre: Research Report > Experimental Study (1.00)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Add feedback

Characterizing Online and Private Learnability under Distributional Constraints via Generalized Smoothness

Blanchard, Moïse, Shetty, Abhishek, Rakhlin, Alexander

arXiv.org Machine LearningFeb-25-2026

Understanding minimal assumptions that enable learning and generalization is perhaps the central question of learning theory. Several celebrated results in statistical learning theory, such as the VC theorem and Littlestone's characterization of online learnability, establish conditions on the hypothesis class that allow for learning under independent data and adversarial data, respectively. Building upon recent work bridging these extremes, we study sequential decision making under distributional adversaries that can adaptively choose data-generating distributions from a fixed family $U$ and ask when such problems are learnable with sample complexity that behaves like the favorable independent case. We provide a near complete characterization of families $U$ that admit learnability in terms of a notion known as generalized smoothness i.e. a distribution family admits VC-dimension-dependent regret bounds for every finite-VC hypothesis class if and only if it is generalized smooth. Further, we give universal algorithms that achieve low regret under any generalized smooth adversary without explicit knowledge of $U$. Finally, when $U$ is known, we provide refined bounds in terms of a combinatorial parameter, the fragmentation number, that captures how many disjoint regions can carry nontrivial mass under $U$. These results provide a nearly complete understanding of learnability under distributional adversaries. In addition, building upon the surprising connection between online learning and differential privacy, we show that the generalized smoothness also characterizes private learnability under distributional constraints.

adversary, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

2602.20585

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry: Education (0.36)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (1.00)

Add feedback

Gradient-Variation Online Learning under Generalized Smoothness

Neural Information Processing SystemsFeb-11-2026, 23:37:19 GMT

Gradient-variation online learning aims to achieve regret guarantees that scale with variations in the gradients of online functions, which is crucial for attaining fast convergence in games and robustness in stochastic o ptimization, hence receiving increased attention. Existing results often req uire the smoothness condition by imposing a fixed bound on gradient Lipschitzness, w hich may be unrealistic in practice. Recent efforts in neural network optim ization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms. In this paper, we systematically study gradient-var iation online learning under generalized smoothness. We extend the classic optimi stic mirror descent algorithm to derive gradient-variation regret by analyzin g stability over the optimization trajectory and exploiting smoothness locally. Th en, we explore universal online learning, designing a single algorithm with the optimal gradient-va riation regrets for convex and strongly convex functions simultane ously, without requiring prior knowledge of curvature. This algorithm adopts a tw o-layer structure with a meta-algorithm running over a group of base-learners . To ensure favorable guarantees, we design a new Lipschitz-adaptive meta-a lgorithm, capable of handling potentially unbounded gradients while ensuring a second-order bound to effectively ensemble the base-learners. Finally, we provi de the applications for fast-rate convergence in games and stochastic extended adv ersarial optimization.

algorithm, artificial intelligence, machine learning, (16 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre:

Research Report > Experimental Study (0.92)
Research Report > New Finding (0.68)

Industry: Education > Educational Setting > Online (1.00)

Technology:

Information Technology > Enterprise Applications > Human Resources > Learning Management (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Convex and Non-convex Optimization Under Generalized Smoothness

Neural Information Processing SystemsDec-26-2025, 05:46:15 GMT

Classical analysis of convex and non-convex optimization methods often requires the Lipschitz continuity of the gradient, which limits the analysis to functions bounded by quadratics. Recent work relaxed this requirement to a non-uniform smoothness condition with the Hessian norm bounded by an affine function of the gradient norm, and proved convergence in the non-convex setting via gradient clipping, assuming bounded noise. In this paper, we further generalize this non-uniform smoothness condition and develop a simple, yet powerful analysis technique that bounds the gradients along the trajectory, thereby leading to stronger results for both convex and non-convex optimization problems. In particular, we obtain the classical convergence rates for (stochastic) gradient descent and Nesterov's accelerated gradient method in the convex and/or non-convex setting under this general smoothness condition. The new analysis approach does not require gradient clipping and allows heavy-tailed noise with bounded variance in the stochastic setting.

convex and non-convex optimization, generalized smoothness, name change, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)

Add feedback

Gradient-Variation Online Learning under Generalized Smoothness

Neural Information Processing SystemsOct-10-2025, 00:37:41 GMT

Gradient-variation online learning aims to achieve regret guarantees that scale with variations in the gradients of online functions, which is crucial for attaining fast convergence in games and robustness in stochastic o ptimization, hence receiving increased attention. Existing results often req uire the smoothness condition by imposing a fixed bound on gradient Lipschitzness, w hich may be unrealistic in practice. Recent efforts in neural network optim ization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms. In this paper, we systematically study gradient-var iation online learning under generalized smoothness. We extend the classic optimi stic mirror descent algorithm to derive gradient-variation regret by analyzin g stability over the optimization trajectory and exploiting smoothness locally. Th en, we explore universal online learning, designing a single algorithm with the optimal gradient-va riation regrets for convex and strongly convex functions simultane ously, without requiring prior knowledge of curvature. This algorithm adopts a tw o-layer structure with a meta-algorithm running over a group of base-learners . To ensure favorable guarantees, we design a new Lipschitz-adaptive meta-a lgorithm, capable of handling potentially unbounded gradients while ensuring a second-order bound to effectively ensemble the base-learners. Finally, we provi de the applications for fast-rate convergence in games and stochastic extended adv ersarial optimization.

algorithm, convex function, smoothness, (15 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre:

Research Report > Experimental Study (0.92)
Research Report > New Finding (0.68)

Industry: Education > Educational Setting > Online (1.00)

Technology:

Information Technology > Enterprise Applications > Human Resources > Learning Management (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Gradient-Variation Online Learning under Generalized Smoothness

Neural Information Processing SystemsMay-26-2025, 22:54:47 GMT

Gradient-variation online learning aims to achieve regret guarantees that scale with variations in the gradients of online functions, which is crucial for attaining fast convergence in games and robustness in stochastic optimization, hence receiving increased attention. Existing results often require the smoothness condition by imposing a fixed bound on gradient Lipschitzness, which may be unrealistic in practice. Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms. In this paper, we systematically study gradient-variation online learning under generalized smoothness. We extend the classic optimistic mirror descent algorithm to derive gradient-variation regret by analyzing stability over the optimization trajectory and exploiting smoothness locally.

artificial intelligence, gradient-variation online learning, machine learning, (6 more...)

Neural Information Processing Systems

Industry: Education > Educational Setting > Online (0.93)

Technology:

Information Technology > Enterprise Applications > Human Resources > Learning Management (0.93)
Information Technology > Artificial Intelligence > Machine Learning (0.80)

Add feedback

Efficiently Escaping Saddle Points under Generalized Smoothness via Self-Bounding Regularity

Cao, Daniel Yiming, Chen, August Y., Sridharan, Karthik, Tang, Benjamin

arXiv.org Artificial IntelligenceMar-6-2025

In this paper, we study the problem of non-convex optimization on functions that are not necessarily smooth using first order methods. Smoothness (functions whose gradient and/or Hessian are Lipschitz) is not satisfied by many machine learning problems in both theory and practice, motivating a recent line of work studying the convergence of first order methods to first order stationary points under appropriate generalizations of smoothness. We develop a novel framework to study convergence of first order methods to first and \textit{second} order stationary points under generalized smoothness, under more general smoothness assumptions than the literature. Using our framework, we show appropriate variants of GD and SGD (e.g. with appropriate perturbations) can converge not just to first order but also \textit{second order stationary points} in runtime polylogarithmic in the dimension. To our knowledge, our work contains the first such result, as well as the first 'non-textbook' rate for non-convex optimization under generalized smoothness. We demonstrate that several canonical non-convex optimization problems fall under our setting and framework.

efficiently escaping saddle point, generalized smoothness, self-bounding regularity

arXiv.org Artificial Intelligence

2503.04712

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.53)

Add feedback

Convex and Non-convex Optimization Under Generalized Smoothness

Neural Information Processing SystemsJan-19-2025, 10:49:23 GMT

Classical analysis of convex and non-convex optimization methods often requires the Lipschitz continuity of the gradient, which limits the analysis to functions bounded by quadratics. Recent work relaxed this requirement to a non-uniform smoothness condition with the Hessian norm bounded by an affine function of the gradient norm, and proved convergence in the non-convex setting via gradient clipping, assuming bounded noise. In this paper, we further generalize this non-uniform smoothness condition and develop a simple, yet powerful analysis technique that bounds the gradients along the trajectory, thereby leading to stronger results for both convex and non-convex optimization problems. In particular, we obtain the classical convergence rates for (stochastic) gradient descent and Nesterov's accelerated gradient method in the convex and/or non-convex setting under this general smoothness condition. The new analysis approach does not require gradient clipping and allows heavy-tailed noise with bounded variance in the stochastic setting.

convex and non-convex optimization, generalized smoothness, smoothness condition, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.65)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.65)

Add feedback

Error Feedback under $(L_0,L_1)$-Smoothness: Normalization and Momentum

Khirirat, Sarit, Sadiev, Abdurakhmon, Riabinin, Artem, Gorbunov, Eduard, Richtárik, Peter

arXiv.org Artificial IntelligenceOct-22-2024

We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems. Despite their popularity and efficiency in training deep neural networks, traditional analyses of error feedback algorithms rely on the smoothness assumption that does not capture the properties of objective functions in these problems. Rather, these problems have recently been shown to satisfy generalized smoothness assumptions, and the theoretical understanding of error feedback algorithms under these assumptions remains largely unexplored. Moreover, to the best of our knowledge, all existing analyses under generalized smoothness either i) focus on single-node settings or ii) make unrealistically strong assumptions for distributed settings, such as requiring data heterogeneity, and almost surely bounded stochastic gradient noise variance. In this paper, we propose distributed error feedback algorithms that utilize normalization to achieve the $O(1/\sqrt{K})$ convergence rate for nonconvex problems under generalized smoothness. Our analyses apply for distributed settings without data heterogeneity conditions, and enable stepsize tuning that is independent of problem parameters. Additionally, we provide strong convergence guarantees of normalized error feedback algorithms for stochastic settings. Finally, we show that due to their larger allowable stepsizes, our new normalized error feedback algorithms outperform their non-normalized counterparts on various tasks, including the minimization of polynomial functions, logistic regression, and ResNet-20 training.

artificial intelligence, exp, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2410.16871

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Asia > China (0.04)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.34)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Add feedback

Near-Optimal Non-Convex Stochastic Optimization under Generalized Smoothness

Liu, Zijian, Jagabathula, Srikanth, Zhou, Zhengyuan

arXiv.org Artificial IntelligenceOct-27-2023

The generalized smooth condition, $(L_{0},L_{1})$-smoothness, has triggered people's interest since it is more realistic in many optimization problems shown by both empirical and theoretical evidence. Two recent works established the $O(\epsilon^{-3})$ sample complexity to obtain an $O(\epsilon)$-stationary point. However, both require a large batch size on the order of $\mathrm{ploy}(\epsilon^{-1})$, which is not only computationally burdensome but also unsuitable for streaming applications. Additionally, these existing convergence bounds are established only for the expected rate, which is inadequate as they do not supply a useful performance guarantee on a single run. In this work, we solve the prior two problems simultaneously by revisiting a simple variant of the STORM algorithm. Specifically, under the $(L_{0},L_{1})$-smoothness and affine-type noises, we establish the first near-optimal $O(\log(1/(\delta\epsilon))\epsilon^{-3})$ high-probability sample complexity where $\delta\in(0,1)$ is the failure probability. Besides, for the same algorithm, we also recover the optimal $O(\epsilon^{-3})$ sample complexity for the expected convergence with improved dependence on the problem-dependent parameter. More importantly, our convergence results only require a constant batch size in contrast to the previous works.

algorithm, lemma 4, near-optimal non-convex stochastic optimization, (12 more...)

arXiv.org Artificial Intelligence

2302.06032

Country: